Batch: June 2022
Data Analytics with Python
Task : Peform data analysis on India's statewise covid-19 data using various visualization tools.
import libraries¶import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
data= pd.read_csv('Covid-19_India.csv')
data.head()
| State/UTs | Total Cases | Active | Discharged | Deaths | Active Ratio | Discharge Ratio | Death Ratio | Population | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Andaman and Nicobar | 10039 | 0 | 9910 | 129 | 0.00 | 98.72 | 1.28 | 380581 |
| 1 | Andhra Pradesh | 2319869 | 62 | 2305076 | 14731 | 0.00 | 99.36 | 0.63 | 49577103 |
| 2 | Arunachal Pradesh | 64504 | 1 | 64207 | 296 | 0.00 | 99.54 | 0.46 | 1383727 |
| 3 | Assam | 724225 | 4 | 716235 | 7986 | 0.00 | 98.90 | 1.10 | 31205576 |
| 4 | Bihar | 830702 | 43 | 818403 | 12256 | 0.01 | 98.52 | 1.48 | 104099452 |
data.rename(columns = {'Active Ratio':'Active Ratio (%)', 'Discharge Ratio': 'Discharge Ratio (%)', 'Death Ratio': 'Death Ratio (%)'}, inplace = True)
data.dtypes
State/UTs object Total Cases int64 Active int64 Discharged int64 Deaths int64 Active Ratio (%) float64 Discharge Ratio (%) float64 Death Ratio (%) float64 Population int64 dtype: object
data.columns
Index(['State/UTs', 'Total Cases', 'Active', 'Discharged', 'Deaths',
'Active Ratio (%)', 'Discharge Ratio (%)', 'Death Ratio (%)',
'Population'],
dtype='object')
data.isnull().sum()
State/UTs 0 Total Cases 0 Active 0 Discharged 0 Deaths 0 Active Ratio (%) 0 Discharge Ratio (%) 0 Death Ratio (%) 0 Population 0 dtype: int64
data.duplicated().sum()
0
data.shape
(36, 9)
data.size
324
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36 entries, 0 to 35 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State/UTs 36 non-null object 1 Total Cases 36 non-null int64 2 Active 36 non-null int64 3 Discharged 36 non-null int64 4 Deaths 36 non-null int64 5 Active Ratio (%) 36 non-null float64 6 Discharge Ratio (%) 36 non-null float64 7 Death Ratio (%) 36 non-null float64 8 Population 36 non-null int64 dtypes: float64(3), int64(5), object(1) memory usage: 2.7+ KB
data.describe()
| Total Cases | Active | Discharged | Deaths | Active Ratio (%) | Discharge Ratio (%) | Death Ratio (%) | Population | |
|---|---|---|---|---|---|---|---|---|
| count | 3.600000e+01 | 36.000000 | 3.600000e+01 | 36.000000 | 36.000000 | 36.000000 | 36.000000 | 3.600000e+01 |
| mean | 1.198233e+06 | 415.416667 | 1.183250e+06 | 14567.027778 | 0.026389 | 98.845000 | 1.128056 | 3.362689e+07 |
| std | 1.771219e+06 | 797.343424 | 1.745163e+06 | 26960.635812 | 0.033050 | 0.494221 | 0.491440 | 4.305758e+07 |
| min | 1.003900e+04 | 0.000000 | 9.910000e+03 | 4.000000 | 0.000000 | 97.650000 | 0.030000 | 6.447300e+04 |
| 25% | 9.912050e+04 | 4.000000 | 9.802975e+04 | 1104.500000 | 0.000000 | 98.520000 | 0.870000 | 1.439840e+06 |
| 50% | 5.892080e+05 | 69.000000 | 5.828095e+05 | 6505.500000 | 0.010000 | 98.860000 | 1.100000 | 2.106970e+07 |
| 75% | 1.285924e+06 | 385.250000 | 1.276082e+06 | 14208.250000 | 0.040000 | 99.112500 | 1.397500 | 5.229275e+07 |
| max | 7.882476e+06 | 3799.000000 | 7.732792e+06 | 147856.000000 | 0.120000 | 99.970000 | 2.340000 | 1.998123e+08 |
Numeric Correlations
data.corr()
| Total Cases | Active | Discharged | Deaths | Active Ratio (%) | Discharge Ratio (%) | Death Ratio (%) | Population | |
|---|---|---|---|---|---|---|---|---|
| Total Cases | 1.000000 | 0.793212 | 0.999987 | 0.943813 | 0.164050 | -0.135375 | 0.125179 | 0.533583 |
| Active | 0.793212 | 1.000000 | 0.794030 | 0.684131 | 0.535411 | -0.121842 | 0.086232 | 0.277804 |
| Discharged | 0.999987 | 0.794030 | 1.000000 | 0.942144 | 0.164359 | -0.133217 | 0.122980 | 0.533843 |
| Deaths | 0.943813 | 0.684131 | 0.942144 | 1.000000 | 0.122715 | -0.266886 | 0.260741 | 0.490676 |
| Active Ratio (%) | 0.164050 | 0.535411 | 0.164359 | 0.122715 | 1.000000 | -0.147545 | 0.079946 | 0.040004 |
| Discharge Ratio (%) | -0.135375 | -0.121842 | -0.133217 | -0.266886 | -0.147545 | 1.000000 | -0.997654 | -0.058038 |
| Death Ratio (%) | 0.125179 | 0.086232 | 0.122980 | 0.260741 | 0.079946 | -0.997654 | 1.000000 | 0.057243 |
| Population | 0.533583 | 0.277804 | 0.533843 | 0.490676 | 0.040004 | -0.058038 | 0.057243 | 1.000000 |
Corrrelation using Heatmap
plt.figure(figsize=(6,6))
sns.heatmap(data.corr())
plt.title("Correlation Between the Datacolumns")
Text(0.5, 1.0, 'Correlation Between the Datacolumns')
Correlation using Scatter Matrix
scatter_matrix(data,figsize=[15,15],diagonal='kde')
plt.show()
Conclusion: It is evident from these plots and the numerical data of correlation that the total number cases and number of discharged people show high positive correlation therefore lot of people are getting recovered from this disease. This is also supported by the negative correlation between death ratio and discharge ratio. Although there is a significant amount of correlation between number deaths and total cases it is because of the exponetial spread of the virus.
#Getting the different attributes of the dataset
attributes=list(data.columns)
attributes.remove('State/UTs')
attributes
['Total Cases', 'Active', 'Discharged', 'Deaths', 'Active Ratio (%)', 'Discharge Ratio (%)', 'Death Ratio (%)', 'Population']
for attribute in attributes:
data=data.sort_values(attribute,ascending=False)
plt.figure(figsize=(10,10))
sns.barplot(x='State/UTs',y=attribute,palette='CMRmap',data=data)
plt.title(attribute+" per State/UTs ",fontsize=15)
plt.xticks(rotation=90)
plt.show();
Conclusion: Maharashtra has the highest number of total cases and Andman and Nicobar are at the last place. Kerala has the highest number of active cases. Number of discharged people is large in Maharashtra but number of deaths are large as well that is because of the large and dense population in the different cities of Maharashtra. Haryana has the highest active case ratio and Punjab has the highest death ratio but the discharge ratio is almost same for each and every state / union territory. Finally the last plot gives population of each state / union territory.
for attribute in attributes:
fig=px.choropleth(data,
geojson="https://gist.githubusercontent.com/jbrobst/56c13bbbf9d97d187fea01ca62ea5112/raw/e388c4cae20aa53cb5090210a42ebb9b765c0a36/india_states.geojson",
featureidkey='properties.ST_NM',
locations='State/UTs',
color=attribute,
color_continuous_scale='Inferno',
title=attribute+' per State / UTs' ,
height=700
)
fig.update_geos(fitbounds="locations", visible=False)
fig.show()
# Scroll Down From Here >>>>>>>
The above plots show same comparitive relationship between states/UTs and the attributes like the barplots, but with the map we can easily understand how the virus might have spread across the contry. The limitation of this plot is due smaller size union territories are not easily visible.
data2=data.sum(axis=0)
data2[['Total Cases','Active','Discharged','Deaths']]
Total Cases 43136371 Active 14955 Discharged 42597003 Deaths 524413 dtype: object
The End